Multilingual Language Identification: ALTW 2010 Shared Task Data

نویسندگان

Timothy Baldwin

Marco Lui

چکیده

While there has traditionally been strong interest in the task of monolingual language identification, research on multilingual language identification is underrepresented in the literature, partly due to a lack of standardised datasets. This paper describes an artificially-generated dataset for multilingual language identification, as used in the 2010 Australasian Language Technology Workshop shared task.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Hierarchical classification for Multilingual Language Identification and Named Entity Recognition

This paper describes the approach for Subtask-1 of the FIRE2015 Shared Task on Mixed Script Information Retrieval. The subtask involved multilingual language identification (including mixed words and anomalous foreign words), named entity recognition (NER) and subclassification. The proposed methodology starts with cleaning the data and then extracting structural and contextual features from th...

متن کامل

Automatic Detection and Language Identification of Multilingual Documents

Language identification is the task of automatically detecting the language(s) present in a document based on the content of the document. In this work, we address the problem of detecting documents that contain text from more than one language (multilingual documents). We introduce a method that is able to detect that a document is multilingual, identify the languages present, and estimate the...

متن کامل

Word Level Language Identification in Online Multilingual Communication

Multilingual speakers switch between languages in online and spoken communication. Analyses of large scale multilingual data require automatic language identification at the word level. For our experiments with multilingual online discussions, we first tag the language of individual words using language models and dictionaries. Secondly, we incorporate context to improve the performance. We ach...

متن کامل

Data-Driven Dependency Parsing across Languages and Domains: Perspectives from the CoNLL-2007 Shared task

The Conference on Computational Natural Language Learning features a shared task, in which participants train and test their learning systems on the same data sets. In 2007, as in 2006, the shared task has been devoted to dependency parsing, this year with both a multilingual track and a domain adaptation track. In this paper, I summarize the main findings from the 2007 shared task and try to i...

متن کامل

CoNLL-X Shared Task on Multilingual Dependency Parsing

Each year the Conference on Computational Natural Language Learning (CoNLL)1 features a shared task, in which participants train and test their systems on exactly the same data sets, in order to better compare systems. The tenth CoNLL (CoNLL-X) saw a shared task on Multilingual Dependency Parsing. In this paper, we describe how treebanks for 13 languages were converted into the same dependency ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2010

Multilingual Language Identification: ALTW 2010 Shared Task Data

نویسندگان

چکیده

منابع مشابه

Hierarchical classification for Multilingual Language Identification and Named Entity Recognition

Automatic Detection and Language Identification of Multilingual Documents

Word Level Language Identification in Online Multilingual Communication

Data-Driven Dependency Parsing across Languages and Domains: Perspectives from the CoNLL-2007 Shared task

CoNLL-X Shared Task on Multilingual Dependency Parsing

عنوان ژورنال:

اشتراک گذاری